This project introduces the hypothesis that there is a geographical relationship (location) when grouping countries by a set of socioeconomic and demographic variables which will linger through time. One might also be interested in determining the most relevant variables among these which capture the maximum percentage variability in the data. This is useful to any analyst since it enables to perform dimensionality reduction as a prior step to constructing additional models (maybe supervised ones, if this is the goal). Since we will count on panel data for our countries, we will study any potentially relevant changes that may have taken place since the 80s to the early 2010s.
What are the underlying patterns and structures in the relationships between various socio-economic variables, such as GDP growth rate, school years, and life expectancy, across different countries?
Our point of departure will be to load the relevant libraries required to perform our analysis. janitor and forcats are some packages that will enable the researcher to operate with factors and column name manipulation. Mice library will allow us to operate with an automatic missing value manipulation at later stages of the project. cluster and mclust libraries will help us later to perform cluster analysis on the second section of the project.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(stringr)
library(forcats)
library(gganimate)
library(mice)
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
##
## The following object is masked from 'package:purrr':
##
## map
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
In order to bring the data together, many sources had to be resorted to and merged together into a single dataframe. The process of pre cleaning, string manipulation and/or recoding is documented below. For the sake of the reader, we will provide this information as a chunk of non-executable code and we will operate with the final version of the data for our statistical analysis:
# Read and clean data -----------------------------------------------------
gdp_growth <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/API_NY.GDP.PCAP.KD.ZG_DS2_en_csv_v2_4770505.csv",
skip = 4) %>%
clean_names() %>%
pivot_longer(starts_with("x"),
names_to = "Year",
values_to = "Gdppc_growth") %>%
select(-contains(c("indicator","code"))) %>%
mutate(Year = as.numeric(str_remove(Year,"x")))
pop_growth <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/API_SP.POP.GROW_DS2_en_csv_v2_4770493.csv",
skip = 4) %>%
clean_names() %>%
pivot_longer(starts_with("x"),
names_to = "Year",
values_to = "Pop_growth") %>%
select(-contains(c("indicator","code"))) %>%
mutate(Year = as.numeric(str_remove(Year,"x")))
continent <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/continents-according-to-our-world-in-data.csv") %>%
select(Entity,Continent)
school_years <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/mean-years-of-schooling-long-run.csv")
names(school_years)[4] <- "School_years"
school_years <- school_years %>%
select(-Code)
educ_expend <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/total-government-expenditure-on-education-gdp.csv")
names(educ_expend)[4] <- "Educ_expend"
educ_expend <- educ_expend %>%
select(-Code)
educ_expend
health <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/life-expectancy-vs-healthcare-expenditure.csv")
names(health)[c(4,5)] <- c("Life_expec","health_expdpc")
health <- health %>%
select(1:5, -2)
migration <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/migration.csv") %>%
select(contains("Net"), Year, Country) %>%
select(c(1,3,4)) %>%
clean_names()
Age <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/median-age.csv") %>%
select(c(1,3,4))
names(Age)[3] <- "Median_age"
marriage <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/marriage-rate-per-1000-inhabitants.csv") %>%
select(c(1,3,4))
names(marriage)[3] <- "marriage_per_1000"
civil_rights <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/civil-liberties-fh.csv") %>%
select(c(1,3,4))
names(civil_rights)[3] <- "civil_rights"
#Recall 1 is best and 7 is worst
geography <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/average-latitude-longitude-countries.csv") %>%
select(c(2,3,4))
religion <- read.csv("C:/Users/Asus/OneDrive/Desktop/Data TFM/religion.txt",
sep = "") %>%
select(Country, Feel) %>%
rename("relig_feel" = "Feel")
At this stage one can already have a taste of what we will find in the final version of the data. Notice we have operated with socio-economic variables in combination with demographics. Any unnecessary or empty columns were removed, and the particularity of our data calls for the inclusion of a time-dimension since we own yearly information in our rows.
We merge all these information together:
# Merging datasets --------------------------------------------------------
df <- gdp_growth %>%
left_join(pop_growth,
by = c("country_name","Year")) %>%
left_join(continent,
by = c("country_name" = "Entity")) %>%
left_join(school_years,
by = c("country_name" = "Entity","Year")) %>%
left_join(educ_expend,
by = c("country_name" = "Entity","Year")) %>%
left_join(health,
by = c("country_name" = "Entity","Year")) %>%
left_join(migration,
by = c("country_name" = "country","Year" = "year")) %>%
left_join(Age,
by = c("country_name" = "Entity","Year")) %>%
left_join(marriage,
by = c("country_name" = "Entity","Year")) %>%
left_join(civil_rights,
by = c("country_name" = "Entity","Year"))%>%
left_join(geography,
by = c("country_name" = "Country"))%>%
left_join(religion,
by = c("country_name" = "Country"))
Now we have complete information for each of our economies and by year. However, we will have to perform many transformations and devote some time for inspection and feature engineering if we want to operate with this data (We have NA´s, categorical variables, years without information etc.) Let´s read our final data and have a quick scan through the variables:
df <- read.csv("Country_data.csv")
head(df)
## X country_name Year Gdppc_growth Pop_growth Continent School_years
## 1 1 Aruba 1960 NA NA North America NA
## 2 2 Aruba 1961 NA 2.179059 North America NA
## 3 3 Aruba 1962 NA 1.548572 North America NA
## 4 4 Aruba 1963 NA 1.389337 North America NA
## 5 5 Aruba 1964 NA 1.215721 North America NA
## 6 6 Aruba 1965 NA 1.032841 North America NA
## Educ_expend Life_expec health_expdpc net_migration_rate Median_age
## 1 NA 65.662 NA 11.371 17.3
## 2 NA 66.074 NA NA 17.3
## 3 NA 66.444 NA NA 17.4
## 4 NA 66.787 NA NA 17.4
## 5 NA 67.113 NA NA 17.5
## 6 NA 67.435 NA -15.499 17.6
## marriage_per_1000 civil_rights Latitude Longitude relig_feel
## 1 NA NA 12.5 -69.97 NA
## 2 NA NA 12.5 -69.97 NA
## 3 NA NA 12.5 -69.97 NA
## 4 NA NA 12.5 -69.97 NA
## 5 NA NA 12.5 -69.97 NA
## 6 NA NA 12.5 -69.97 NA
We have information on the country name, the GDP and population growth rates, the continent, expected school years, expenditure on educaction, life expectancy, net migration, median population age, a categorical variable accounting for civil rights, some geographical variables and religious feeling. In order to gain further insight into the variables taking part in the exploration and their hidden relationships, we will perform a descriptive analysis.
We begin with a boxplot analysis of the median age by faceting by continent and adding a time dimension by distinguishing between periods prior to and after 1990. Median age has been chosen in contrast to mean measures because it is more robust to outliers and provides information about the centre of the data.
df %>%
drop_na(Median_age,Continent) %>%
ggplot()+
aes(Median_age, fill = Year >1990)+
geom_boxplot()+
facet_wrap(~Continent, scales = "free")+
theme_dark()+
theme(panel.grid = element_blank(),
strip.text = element_text(size = 10, face = "bold"))+
labs(x = "Median Age", fill = "Time after 1990")
One notices how there is a significant shift in time in terms of median age, especially for America and Europe. We perceive the effect of ageing population and also how this difference is seldom noticed in Africa such that the median age has remained stagnant for both periods. This tells us a lot of information in terms of population structures in different continents.
Now we will establish some relationships between our civil rights variable and the importance of religion in each economy. To take advantage of the time dimension just like in our previous plot, we will divide again the period in two (before and after the 90s) and make our conclusions:
df %>%
drop_na(relig_feel, civil_rights) %>%
ggplot()+
aes(civil_rights,relig_feel, fill = Year > 1990)+
geom_bar(stat = "identity")+
facet_wrap(~Year>1990, scale = "free")+
theme_dark()+
theme(panel.grid = element_blank(),
strip.text = element_blank(),
plot.title = element_text(hjust = 0.5))+
labs(x = "Civil Rights", fill = "Time after 1990",
y = "Religion Importance",
title = "More religious economies are likely \n to score worse in civil liberties")+
scale_fill_discrete(labels=c('Before 1990', 'After 1990'))
This graph conveys two main pieces of information. First, we see that those economies which hold the highest position in civil rights ranking (1,2 or 3) are usually the ones which devote less importance to religion (shortest bars in both periods). As religiousness increases, the probability to find economies where civil rights are lower than 3 rises. In addition, we see that some religious economies are starting to improve in terms of civil liberties (see how position 4 in civil rights now has the tallest bar).
Next, we will plot information about the relationship between years of schooling and life expectancy. Again, we will categorize our information between periods prior to the 90s and years prior to 1990. We will also facet by continent to have a better understanding of the information:
df %>%
drop_na(Continent,Year) %>%
ggplot()+
aes(School_years,Life_expec, color = Year<1990)+
geom_point()+
facet_wrap(~Continent)+
theme_dark()+
labs(x = "Years of schooling",
y = "Life expectancy",
color = "Time before 1990")+
theme(strip.text = element_text(size = 10, face = "bold"))
## Warning: Removed 6746 rows containing missing values (`geom_point()`).
One notices that there seems to be a positive and strong relationship between years of schooling and life expectancy, and this relationship holds diminishing returns for regions like Asia or North America. This means that we observe a concave relationship (increases tend to become weaker) whereas in Europe or Oceania the relationship seems to hold a linear with constant slope pattern. If we focus on Africa before the 90s, the slope is huge, which reveals that investing in education during those years had a significantly large outcome on life expectancy and living conditions for African economies.
Now we will take advantage of our geography and computational skills to locate a bit better the religious variable in Europe:
region <- df %>%
filter(Continent == "Europe", Year == 1990) %>%
select(country_name,relig_feel,civil_rights) %>%
distinct()
map_data("world") %>%
inner_join(region, by = c("region" = "country_name")) %>%
ggplot(aes(long,lat))+
geom_polygon(aes( group = group, fill = relig_feel))+
theme_dark()+
theme(axis.title = element_blank())+
labs(title = "Religion Importance in Europe")
This graph is interesting because it enables the researcher to quickly have an intuition of what are the countries in Europe devoting the most resources to religion. We see that Nordic countries happen to behave as the less religious whereas Poland, Romania or Italy account for the top positions. Spain is positioned somewhere in between these poles, but still being a more religious economy in comparison with France, although much lower than Portugal.
Since we have temporal data, let´s play with it and add some dynamics to our plot. We will now shift to graph information about public health expenditure and life expectancy buy continent. We will resort to library gganimate to help us depict this information and add some transparency to the geometries to avoid overlapping. This takes a while, let´s be patient!
df %>%
drop_na(Continent, Life_expec,health_expdpc) %>%
ggplot()+
aes(health_expdpc, Life_expec, fill = Continent)+
geom_point(shape = 21, color = "black", size = 8, alpha = 0.5)+
facet_wrap(~Continent, scales = "free")+
transition_time(Year)+
labs(subtitle = "{frame_time}")+
theme_dark()+
theme(strip.text = element_text(size = 10, face = "bold"))